Goto

Collaborating Authors

 Clifton


Robust Bayesian Optimisation with Unbounded Corruptions

Ezzerg, Abdelhamid, Bogunovic, Ilija, Knoblauch, Jeremias

arXiv.org Machine Learning

Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless against even a single corruption of sufficient magnitude. To address this, we introduce a new adversary whose budget is only bounded in the frequency of corruptions, not in their magnitude. We then derive RCGP-UCB, an algorithm coupling the famous upper confidence bound (UCB) approach with a Robust Conjugate Gaussian Process (RCGP). We present stable and adaptive versions of RCGP-UCB, and prove that they achieve sublinear regret in the presence of up to $O(T^{1/2})$ and $O(T^{1/3})$ corruptions with possibly infinite magnitude. This robustness comes at near zero cost: without outliers, RCGP-UCB's regret bounds match those of the standard GP-UCB algorithm.


Ontology Creation and Management Tools: the Case of Anatomical Connectivity

Kokash, Natallia, de Bono, Bernard, Gillespie, Tom

arXiv.org Artificial Intelligence

Ontologies are essential for developing standardized vocabularies and defining relationships that help describe and interpret data from diverse sources. They are crucial for achieving semantic interoperability in many domains, allowing different systems to exchange data with a consistent and shared meaning. Ontologies are extensively used in biological and biomedical research Hoehndorf et al. (2015); Antezana et al. (2009), due to their ability to: provide standard identifiers for classes and relationships representing complex phenomena; include metadata to clarify the intended meaning of classes and relationships; include machine-readable definitions that allow computational access to class properties and relationships; standardize vocabulary across multiple data sources. Ontology-based data integration plays a vital role in neuroscience, where researchers synthesize knowledge across physiology, anatomy, molecular and developmental biology, cytology, and mathematical modeling to support accurate data representation, analysis, and simulation. A common challenge for many large neuroscience projects is the integration of data across a wide diversity of species, spatial resolutions, and temporal scales.


In silico study on the cytotoxicity against Hela cancer cells of xanthones bioactive compounds from Garcinia cowa: QSAR based on Graph Deep Learning, Network Pharmacology, and Molecular Docking

Son, Nguyen Manh, Vang, Pham Huu, Dung, Nguyen Thi, Thao, Nguyen Manh Ha. Ta Thi, Thuy, Tran Thi Thu, Giang, Phan Minh

arXiv.org Artificial Intelligence

Institute of Natural Products Chemistry, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Nighiado, Cau Giay, Hanoi, Vietnam Abstract: Cancer is recognized as a complex group of diseases, contributing to the highest global mortality rates, with increasing prevalence and a trend toward affecting younger populations. It is characterized by uncontrolled proliferation of abnormal cells, invasion of adjacent tissues, and metastasis to distant organs. Garcinia cowa, a traditional medicinal plant widely used in Southeast Asia, including Vietnam, is employed to treat fever, cough, indigestion, as a laxative, and for parasitic diseases. Numerous xanthone compounds isolated from this species exhibit a broad spectrum of biological activities, with some showing promise as anti-cancer and antimalarial agents. Network pharmacology analysis successfully identified key bioactive compounds Rubraxanthone, Garcinone D, Norcowanin, Cowanol, and Cowaxanthone--alongside their primary protein targets (TNF, CTNNB1, SRC, NFKB1, and MTOR), providing critical insights into the molecular mechanisms underlying their anti-cancer effects. The Graph Attention Network algorithm demonstrated superior predictive performance, achieving an R of 0.98 and an RMSE of 0.02 after data augmentation, highlighting its accuracy in predicting pIC50 values for xanthone-based compounds. Additionally, molecular docking revealed MTOR as a potential target for inducing cytotoxicity in HeLa cancer cells from Garcinia cowa. Keywords: Garcinia cowa, Hela, Network pharmacology, Graph neural network, Molecular docking I. Introduction Cancer is a complex group of diseases and one of the leading causes of mortality worldwide, characterized by the uncontrolled proliferation of abnormal cells, the ability to invade adjacent tissues, and metastasis to distant organs in the body [1, 2].


Sequence-based protein-protein interaction prediction and its applications in drug discovery

Charih, François, Green, James R., Biggar, Kyle K.

arXiv.org Artificial Intelligence

Aberrant protein-protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state-of-the-art for sequence-based PPI prediction methods and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity-based approaches, and deep learning-based approaches with a particular emphasis on the transformer architecture. Finally, we provide examples of PPI prediction in systems-level proteomics analyses, target identification, and design of therapeutic peptides and antibodies. We also take the opportunity to showcase the potential of PPI-aware drug discovery models in accelerating therapeutic development.


Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence

Sun, Yingying, A, Jun, Liu, Zhiwei, Sun, Rui, Qian, Liujia, Payne, Samuel H., Bittremieux, Wout, Ralser, Markus, Li, Chen, Chen, Yi, Dong, Zhen, Perez-Riverol, Yasset, Khan, Asif, Sander, Chris, Aebersold, Ruedi, Vizcaíno, Juan Antonio, Krieger, Jonathan R, Yao, Jianhua, Wen, Han, Zhang, Linfeng, Zhu, Yunping, Xuan, Yue, Sun, Benjamin Boyang, Qiao, Liang, Hermjakob, Henning, Tang, Haixu, Gao, Huanhuan, Deng, Yamin, Zhong, Qing, Chang, Cheng, Bandeira, Nuno, Li, Ming, E, Weinan, Sun, Siqi, Yang, Yuedong, Omenn, Gilbert S., Zhang, Yue, Xu, Ping, Fu, Yan, Liu, Xiaowen, Overall, Christopher M., Wang, Yu, Deutsch, Eric W., Chen, Luonan, Cox, Jürgen, Demichev, Vadim, He, Fuchu, Huang, Jiaxing, Jin, Huilin, Liu, Chao, Li, Nan, Luan, Zhongzhi, Song, Jiangning, Yu, Kaicheng, Wan, Wanggen, Wang, Tai, Zhang, Kang, Zhang, Le, Bell, Peter A., Mann, Matthias, Zhang, Bing, Guo, Tiannan

arXiv.org Artificial Intelligence

Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.


Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches

Liu, Xuefeng, Jiang, Songhao, Duan, Xiaotian, Vasan, Archit, Liu, Chong, Tien, Chih-chan, Ma, Heng, Brettin, Thomas, Xia, Fangfang, Foster, Ian T., Stevens, Rick L.

arXiv.org Machine Learning

Protein-ligand binding [Clyde et al., 2023] refers to the process as shown in Figure 1 by which ligands--usually small molecules, ions, or proteins--generate signals by binding to the active sites of target proteins through intermolecular forces. This binding typically changes the conformation of target proteins, which then results in the realization, modulation, or alteration of protein functions. Therefore, protein-ligand binding plays a central role in most, if not all, important life processes. For example, oxygen molecules are bound and carried through the human body by proteins like hemoglobin, and then utilized for energy production, while nonsteroidal anti-inflammatory drugs (NSAIDs) like ibuprofen work by inhibiting the functionality of the cyclooxygenase (COX) enzyme that thus reducing the release of pain-causing substances in the body. The concept and importance of binding affinity prediction were first addressed in Böhm [1994]: given the 3D structures of a target protein and a potential ligand, the objective is to predict the binding constant of such a complex, along with the most probable binding pose candidates. The prediction of the binding site (the set of protein residues that have at least one non-hydrogen atom within 4.0 Å of a ligand's non-hydrogen atom [Khazanov and Carlson, 2013]) and affinity (binding constants such as inhibition or dissociation constants, or the concentration at 50% inhibition) are usually divided into two separate but related stages [Ballester and Mitchell, 2010a]. One notable motivation for constructing a good binding affinity predictor (or scoring function, as called in some earlier work) is the essential role that it plays in drug discovery [Liu et al., 2023, 2024a] and virtual screening [Meng et al., 2011, Pinzi and Rastelli, 2019, Sadybekov and Katritch, 2023]. Traditional drug discovery essentially involves a process of trial and error.


Computing in the Life Sciences: From Early Algorithms to Modern AI

Donkor, Samuel A., Walsh, Matthew E., Titus, Alexander J.

arXiv.org Artificial Intelligence

Computing in the life sciences has undergone a transformative evolution, from early computational models in the 1950s to the applications of arti cial intelligence (AI) and machine learning (ML) seen today. This paper highlights key milestones and technological advancements through the historical development of computing in the life sciences. The discussion includes the inception of computational models for biological processes, the advent of bioinformatics tools, and the integration of AI/ML in modern life sciences research. Attention is given to AI-enabled tools used in the life sciences, such as scienti c large language models and bio-AI tools, examining their capabilities, limitations, and impact to biological risk. This paper seeks to clarify and establish essential terminology and concepts to ensure informed decision-making and e ective communication across disciplines. The views and opinions expressed within this manuscript are those of the authors and do not necessarily re ect the views and opinions of any organization the authors are a liated with.


FraGNNet: A Deep Probabilistic Model for Mass Spectrum Prediction

Young, Adamo, Wang, Fei, Wishart, David, Wang, Bo, Röst, Hannes, Greiner, Russ

arXiv.org Artificial Intelligence

The process of identifying a compound from its mass spectrum is a critical step in the analysis of complex mixtures. Typical solutions for the mass spectrum to compound (MS2C) problem involve matching the unknown spectrum against a library of known spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Compound to mass spectrum (C2MS) models can improve retrieval rates by augmenting real libraries with predicted spectra. Unfortunately, many existing C2MS models suffer from problems with prediction resolution, scalability, or interpretability. We develop a new probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately predict high-resolution spectra. FraGNNet uses a structured latent space to provide insight into the underlying processes that define the spectrum. Our model achieves state-of-the-art performance in terms of prediction error, and surpasses existing C2MS models as a tool for retrieval-based MS2C.


Machine learning applied to omics data

Calviño, Aida, Moreno-Ribera, Almudena, Pineda, Silvia

arXiv.org Artificial Intelligence

In this chapter we illustrate the use of some Machine Learning techniques in the context of omics data. More precisely, we review and evaluate the use of Random Forest and Penalized Multinomial Logistic Regression for integrative analysis of genomics and immunomics in pancreatic cancer. Furthermore, we propose the use of association rules with predictive purposes to overcome the low predictive power of the previously mentioned models. Finally, we apply the reviewed methods to a real data set from TCGA made of 107 tumoral pancreatic samples and 117,486 germline SNPs, showing the good performance of the proposed methods to predict the immunological infiltration in pancreatic cancer.


MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using Graph Transformers

Young, Adamo, Wang, Bo, Röst, Hannes

arXiv.org Artificial Intelligence

Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over seventy years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose a new model, MassFormer, for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pre-training task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets, and is able to recover prior knowledge about the effect of collision energy on the spectrum. By employing gradient-based attribution methods, we demonstrate that the model can identify relationships between fragment peaks. To further highlight MassFormer's utility, we show that it can match or exceed existing prediction-based methods on two spectrum identification tasks. We provide open-source implementations of our model and baseline approaches, with the goal of encouraging future research in this area.